Goto

Collaborating Authors

 dominant subspace



Sparse Quadratic Optimisation over the Stiefel Manifold with Application to Permutation Synchronisation

Neural Information Processing Systems

Optimisation problems on the Stiefel manifold occur for example in spectral relaxations of various combinatorial problems, such as graph matching, clustering, or permutation synchronisation. Although sparsity is a desirable property in such settings, it is mostly neglected in spectral formulations since existing solvers, e.g. based on eigenvalue decomposition, are unable to account for sparsity while at the same time maintaining global optimality guarantees.


Accelerating Neural Network Training Along Sharp and Flat Directions

Zakarin, Daniyar, Singh, Sidak Pal

arXiv.org Machine Learning

Recent work has highlighted a surprising alignment between gradients and the top eigenspace of the Hessian -- termed the Dominant subspace -- during neural network training. Concurrently, there has been growing interest in the distinct roles of sharp and flat directions in the Hessian spectrum. In this work, we study Bulk-SGD, a variant of SGD that restricts updates to the orthogonal complement of the Dominant subspace. Through ablation studies, we characterize the stability properties of Bulk-SGD and identify critical hyperparameters that govern its behavior. We show that updates along the Bulk subspace, corresponding to flatter directions in the loss landscape, can accelerate convergence but may compromise stability. To balance these effects, we introduce interpolated gradient methods that unify SGD, Dom-SGD, and Bulk-SGD. Finally, we empirically connect this subspace decomposition to the Generalized Gauss-Newton and Functional Hessian terms, showing that curvature energy is largely concentrated in the Dominant subspace. Our findings suggest a principled approach to designing curvature-aware optimizers.


I3S: Importance Sampling Subspace Selection for Low-Rank Optimization in LLM Pretraining

Zhang, Haochen, Yin, Junze, Wang, Guanchu, Liu, Zirui, Zhang, Tianyi, Shrivastava, Anshumali, Yang, Lin, Braverman, Vladimir

arXiv.org Artificial Intelligence

Low-rank optimization has emerged as a promising approach to enabling memory-efficient training of large language models (LLMs). Existing low-rank optimization methods typically project gradients onto a low-rank subspace, reducing the memory cost of storing optimizer states. A key challenge in these methods is identifying suitable subspaces to ensure an effective optimization trajectory. Most existing approaches select the dominant subspace to preserve gradient information, as this intuitively provides the best approximation. However, we find that in practice, the dominant subspace stops changing during pretraining, thereby constraining weight updates to similar subspaces. In this paper, we propose importance sampling subspace selection (I3S) for low-rank optimization, which theoretically offers a comparable convergence rate to the dominant subspace approach. Empirically, we demonstrate that I3S significantly outperforms previous methods in LLM pretraining tasks.


Does SGD really happen in tiny subspaces?

Song, Minhak, Ahn, Kwangjun, Yun, Chulhee

arXiv.org Machine Learning

Understanding the training dynamics of deep neural networks is challenging due to their high-dimensional nature and intricate loss landscapes. Recent studies have revealed that, along the training trajectory, the gradient approximately aligns with a low-rank top eigenspace of the training loss Hessian, referred to as the dominant subspace. Given this alignment, this paper explores whether neural networks can be trained within the dominant subspace, which, if feasible, could lead to more efficient training methods. Our primary observation is that when the SGD update is projected onto the dominant subspace, the training loss does not decrease further. This suggests that the observed alignment between the gradient and the dominant subspace is spurious. Surprisingly, projecting out the dominant subspace proves to be just as effective as the original update, despite removing the majority of the original update component. Similar observations are made for the large learning rate regime (also known as Edge of Stability) and Sharpness-Aware Minimization. We discuss the main causes and implications of this spurious alignment, shedding light on the intricate dynamics of neural network training.


A Federated Data Fusion-Based Prognostic Model for Applications with Multi-Stream Incomplete Signals

Arabi, Madi, Fang, Xiaolei

arXiv.org Machine Learning

Industrial prognostic aims to predict the failure time of machines by utilizing their degradation signals. This is typically achieved by establishing a statistical learning model that maps the degradation signals of machines to their time-to-failure (TTFs) [1, 2]. Similar to that of many other statistical learning models, the implementation of prognostic models usually consists of two steps: model training and real-time monitoring (also known as model testing or deployment). Model training focuses on using a historical dataset that comprises the degradation signals and TTFs of some failed machines to estimate the parameters of the prognostic model; real-time monitoring feeds the real-time degradation signals from a partially degraded onsite machine into the prognostic model trained earlier to predict its TTF or TTF distribution. Most existing prognostic models assume that a historical dataset from a decent number of failed machines is available for model training [3, 4, 5, 6, 7]. In reality, however, the amount of historical data owned by a single organization (e.g., a company, a university lab, a factory, etc.) might be small or not large enough to train a reliable prognostic model.


Sparse Quadratic Optimisation over the Stiefel Manifold with Application to Permutation Synchronisation

Bernard, Florian, Cremers, Daniel, Thunberg, Johan

arXiv.org Machine Learning

We address the non-convex optimisation problem of finding a sparse matrix on the Stiefel manifold (matrices with mutually orthogonal columns of unit length) that maximises (or minimises) a quadratic objective function. Optimisation problems on the Stiefel manifold occur for example in spectral relaxations of various combinatorial problems, such as graph matching, clustering, or permutation synchronisation. Although sparsity is a desirable property in such settings, it is mostly neglected in spectral formulations since existing solvers, e.g. based on eigenvalue decomposition, are unable to account for sparsity while at the same time maintaining global optimality guarantees. We fill this gap and propose a simple yet effective sparsity-promoting modification of the Orthogonal Iteration algorithm for finding the dominant eigenspace of a matrix. By doing so, we can guarantee that our method finds a Stiefel matrix that is globally optimal with respect to the quadratic objective function, while in addition being sparse. As a motivating application we consider the task of permutation synchronisation, which can be understood as a constrained clustering problem that has particular relevance for matching multiple images or 3D shapes in computer vision, computer graphics, and beyond. We demonstrate that the proposed approach outperforms previous methods in this domain.


Network topology change-point detection from graph signals with prior spectral signatures

Kaushik, Chiraag, Roddenberry, T. Mitchell, Segarra, Santiago

arXiv.org Machine Learning

We consider the problem of sequential graph topology change-point detection from graph signals. We assume that signals on the nodes of the graph are regularized by the underlying graph structure via a graph filtering model, which we then leverage to distill the graph topology change-point detection problem to a subspace detection problem. We demonstrate how prior information on the spectral signature of the post-change graph can be incorporated to implicitly denoise the observed sequential data, thus leading to a natural CUSUM-based algorithm for change-point detection. Numerical experiments illustrate the performance of our proposed approach, particularly underscoring the benefits of (potentially noisy) prior information.